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Abstract 

Previous statistical analyses have shown that amino acid sites in a protein evolve in a correlated way instead of 
independently. Even though located distantly in the linear sequence, the coevolved amino acids could be spatially adjacent 
in the tertiary structure, and constitute specific protein sectors. Moreover, these protein sectors are independent of one 
another in structure, function, and even evolution. Thus, systematic studies on protein sectors inside a protein will 
contribute to the clarification of protein function. In this paper, we propose a new algorithm BIFANR (Bi-factor Analysis 
Based on Noise-reduction) for detecting protein sectors in amino acid sequences. After applying BIFANR on SI A family and 
PDZ family, we carried out internal correlation test, statistical independence test, evolutionary rate analysis, evolutionary 
independence analysis, and function analysis to assess the prediction. The results showed that the amino acids in certain 
predicted protein sector are closely correlated in structure, function, and evolution, while protein sectors are nearly 
statistically independent. The results also indicated that the protein sectors have distinct evolutionary directions. In addition, 
compared with other algorithms, BIFANR has higher accuracy and robustness under the influence of noise sites. 
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Introduction 

The amino acids coevolution is very common in various protein 
families [1,2,3]. Highly conserved amino acid sites are often 
located in the core or on the functional surface of protein tertiary 
structure [4,5,6] . These sites usually are under strong evolutionary 
constraint, thus are critical for maintaining the protein's function. 
The amino acid sites that are highly correlated in evolution often 
form protein sectors [7,8] . Protein sectors decompose proteins into 
quasi-independent groups, which are distinct from the traditional 
hierarchy of protein structure. The statistical characteristic analysis 
of the cooperative action of conserved amino acids could 
contribute to the inference of protein function and evolution [1,9]. 

Since functionally important amino acid regions in a protein are 
usually conserved in evolution, researchers have been identifying 
these regions by performing directed mutagenesis experiments 
[10,11,12,13]. However, such experimental approaches are time 
and labor intensive. In order to overcome this problem, 
researchers have developed statistical methods to detect function- 
ally dependent (or correlated) amino acids in proteins using 
coevolution analysis [14,15]. For example, some parametric and 
non-parametric methods were employed to detect important 
amino acid sites [16,17,18,19,20,21,22], which usually focus on 
amino acids important for maintaining the protein structure and 
function. These methods rely on multiple sequence alignment 



(MSA), so the quality and size of MSA and the background 
coevolution noise became the main obstacles [15,18,23]. In 
addition, some other typical probabilistic models have also been 
implemented, e. g. Maximum likelihood approximation 
[24,25,26], Bayesian probabilities [27], phylogenetic approaches 
[28] and sequence divergence based approximation [15,29], 
Lastly, several new ideas were introduced to reduce the influence 
of noise [7,8]. However, these methods can only reach high 
accuracy in some specific protein families, thus cannot be widely 
used. Therefore, there is a need of more effective method to be 
developed. 

In this paper, we propose a new algorithm, named BIFANR (Bi- 
factor analysis based on noise-reduction), to reveal the coevolving 
pattern of amino acid sites. The algorithm originates from the 
Factor Analysis in psychological researches [30], which is widely 
used to analyze psychological factors, such as human personality 
and sensibility. Like previous studies, our algorithm follows the 
following principals: 1) the coevolved amino acid sites in a protein 
constitute a protein sector, which are closely combined in the 
tertiary structure to account for certain biological characteristics; 
2) different protein sectors are independent of each other in terms 
of the tertiary structure and function. However, different from 
other methods, BIFANR first conducts noise reduction before 
factor analysis, which improves efficiency and accuracy. After that, 
a bi-factor analysis is employed to determine the corresponding 



PLOS ONE | www.plosone.org 



1 



November 2013 | Volume 8 | Issue 11 | e79764 



Bi-Factor Analysis Based on Noise-Reduction 



Template sequences Multiple sequence alignment 



Construction of Weighted covariance matrix 



I 



Reduction of noise sites 

htttttt: 



Data Pre-processing 1 1 



Determination of the number of eigenvectors 
Rotation the linear combination of selected eigenvectors 
Two-way selection of protein sectors 
Merging of protein sectors 



Bi-factor analysis 



Function analysis of protein structures 





Examination of protein sectors 






Correlation between 
amino acids in sectors 


Statistical 
independence 


Calculation of 
evolutionary rates 


Evolutionary 
independence 



Results analysis I 



Figure 1. Methods flowchart. 
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eigenvectors of non-random eigenvalues with a stochastic simu- 
lation and then to extract protein sectors. In linear combination of 
eigenvectors, this algorithm employs varimax orthogonal rotation 
to ensure independence between protein sectors. Furthermore, we 
applied BIFANR to a PDB structure 3TGI of the S1A serine 
protease family and 1BE9 of the PSD95/Digl/Z01 (PDZ) family. 
As a result, we found 3 protein sectors in 3TGI and 2 in 1BE9. 
Further analysis showed that BIFANR has higher accuracy and 
robustness compared with other algorithms. The flowchart of the 
complete analyses is presented in Figure 1 . The source code of the 
BIFANR program is available in the file Program S 1 . 

Results 

1. The Algorithm Design of BIFANR 

The BIFANR algorithm consists of two major components, 
detailed as follows: 

1.1 Correlation coefficient matrix and noise 
reduction. This algorithm applies the idea of Factor Analysis 
to amino acid site analysis to extract protein sectors. Specifically, 
starting from a given MSA, we first calculated the correlation 
coefficients between amino acid sites and constructed a covariance 
matrix (non-weighted correlation coefficient matrix, see Meth- 
ods). Considering the biological significance, we then gave weights 
to the covariance matrix like previous studies [7,31]. Finally, we 
calculated the weighted correlation coefficient matrix based on the 
background frequency of the 20 amino acids and the conservation 
of amino acid sites. As a result, we have measured the pair-wise 
correlation of amino acid sites with this matrix, based on which we 
further conducted noise reduction. 

The noise sites are amino acid sites that are weakly correlated 
with almost all the other sites. These noise sites usually reduce the 
efficiency and accuracy of the algorithm to identify protein sectors. 
This is the main reason causing the failure of some covariance 
amino acid sites detecting methods [32]. In order to overcome this 
problem, we developed a noise reduction method in the pre- 



processing step. Specifically, the amino acid sites with high 
randomness in evolution are removed before the detection step. 

Taking S1A family and PDZ family as examples: in the S1A 
family, there were 223 sites in the multiple sequence alignment 
(MSA) of 3TGI [33] and its homologous protein sequences and 
104 sites were removed by the noise reduction step; in the PDZ 
family, 49 sites of the total 94 sites were removed after the noise 
reduction step. These removed sites are weakly correlated with 
other sites, and have higher evolutionary rates than the remaining 
sites. Calculated by Rate4Site, in S1A family [34], the average 
evolutionary rates of removed and remained sites are 0.7692 and 
— 0.6723 respectively. In PDZ family they are 0.6717 and 
—0.7314 respectively. 

1.2 Bi-factor analysis. In the bi-factor analysis, we obtained 
protein sectors according to the eigenvectors of the weighted 
correlation coefficient matrix. In order to guarantee the non- 
randomness of the predicted protein sectors, we simulated the data 
by randomly shuffling the multiple sequence alignment for 100 
times, and then chose the non-random eigenvectors of the 
correlation coefficient matrix based on the stochastic simulation 
result. Using this method, we can find the non-random protein 
sectors hidden in the protein sequence. The original factor 
coefficient for each amino acid site can be considered as the 
correlation between the site and the factor (i.e. selected eigenvec- 
tors of the correlation coefficient matrix, see Methods). BIFANR 
assigns amino acid sites to factors according to the correlation. 
However, we cannot obtain protein sectors based on the original 
factor coefficients directly because one site may have similar 
coefficients with different factors. Thus we have further conducted 
varimax orthogonal rotation for these factors. Our ideal expecta- 
tion is that each site will have a large coefficient value with just one 
factor, which could be sufficient to distinguish this factor from the 
remaining factors. Consequently, the protein sectors detected by 
BIFANR will have significant statistical independence. 
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Figure 2. Average correlation coefficients of protein sectors. A: Average correlation coefficient of each protein sector in S1A family. B: 
Average correlation coefficient of each protein sector in PDZ family. Red column represents average correlation coefficient of protein sector 1. Blue 
column represents average correlation coefficient of protein sector 2. Green column represents average correlation coefficient of protein sector 3. 
Black column represents stochastic expected average correlation coefficient of each protein sector. 
doi:1 0.1 371 /journal.pone.0079764.g002 



After varimax orthogonal rotation, amino acid sites were 
assigned to factors according to the factor coefficients calculated 
above. As the coefficient is within the range [— 1, 1], there are both 
positive and negative correlation and the larger the absolute value 
is, the more significant the correlation is. Those sites might also 
form a protein sector, if they have significant negative correlation 
with one factor. Therefore, BIFANR conducted bidirectional 
selection of factor coefficients on the basis of factor analysis, which 
could prevent the loss of protein sectors due to solely selection of 
positive factor coefficients. However, bidirectional selection may 
cause the occurrence of two overlapping protein sectors. In order 
to merge overlapping protein sectors, we retain the overlap and 
then add those sites having higher correlation with current sites. 

2. Statistical and Biological Tests for Protein Sectors 

BIFANR detected three protein sectors in S1A family and two 
protein sectors in PDZ family. To verify these protein sectors and 
evaluate the performance of our algorithm, we conducted 
statistical tests and did biological analysis with these protein 
sectors. The statistical tests include internal correlation test and 
statistical independence test. Besides, we conducted evolutionary 
rate test. 



To demonstrate the correlation between the amino acid sites 
within a protein sector, we took all of the amino acid sites in each 
protein sector and calculated the mean correlation coefficients 
between each pair of the amino acid sites. In addition, we 
randomly simulated the data set with the same number of sites for 
1000 times and similarly calculated the mean correlation 
coefficients to be the random expectation. The results showed 
that in each protein sector, the average of correlation coefficients is 
much higher than the random expectation (Figure 2). 

To illustrate the statistical independence between protein 
sectors, we calculated the MDI entropy of SI A family and PDZ 
family, respectively. The MDI entropy was originally used to 
quantify the degree to which a selected group of amino acid sites 
are statistically coupled to each other in an MSA. If two protein 
sectors are independent, the MDI entropy of them taken together 
should be equal to the sum of their MDI entropies taken 
individually in theory. The results supported this conjecture by 
showing that the MDI entropy of each two predicted protein 
sectors was much higher than the random expectation, which 
means that the protein sectors are statistically independent of each 
other (Figure 3). 

To study the evolutionary feature of amino acid sites within a 
protein sector, we calculated the average evolutionary rates of the 
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Figure 3. Statistical independence of protein sectors. A: Statistical independence of protein sectors in S1A family. B: Statistical independence 
of protein sectors in PDZ family. (Red column represents MDI entropy of protein sector 1. Blue column represents MDI entropy of protein sector 2. 
Green column represents MDI entropy of protein sector 3. Black column represents MDI entropy of two protein sectors as a whole. Yellow column 
represents stochastic expected MDI entropy after disrupting the amino acid sites within two protein sectors 100 times.). 
doi:10.1371/journal.pone.0079764.g003 



amino acid sites in the entire protein and the amino acid sites in 
the protein sectors, respectively. The results showed that the latter 
was much lower than the former (Table 1). Figure 4 shows the 
result for both the SI A family and the PDZ family, where for both 
families, the evolutionary rates of over 90% sites in protein sectors 
are negative, suggesting that these sites have lower evolutionary 
rates and thus are selectively constrained to maintain the protein 
structure and function. 

3. Comparison to Buck's Method 

In order to evaluate the performance of BIFANR, we compared 
our method with Buck's method [8], In comparison, we chose 
3TGI of the S1A family and 1BE9 [35] of the PDZ family as 
template sequences, since protein sectors in these two template 
sequences have been experimentally verified [7]. Then we 
evaluated the predicted results of the two methods by comparing 



Table 1. Average evolutionary rate of each protein sector 
and average evolutionary rate of all sites in SI A and PDZ. 



experimentally confirmed sectors with our predicted sectors (i.e. 
factors in [8] ). If all or most sites of an experimentally confirmed 
sector are found in just one predicted sector, it means that the 
prediction is reliable. Otherwise, if sites are found in several 
predicted sectors, it indicates that the prediction is unreliable. We 
then calculated the percentage of experimentally confirmed sectors 
that are found in our predicted sectors, i.e. sensitivity and the 
percentage of our predicted sectors to be true positives (i.e. 
experimentally confirmed sectors), i.e. positive predictive value 
(PPV). 

For the result of Buck's method, the sites in experimentally 
confirmed sectors distributed almost uniformly in different 
predicted sectors. But for our result, the sites in any experimentally 
confirmed sector distributed on just one predicted sector (Figure 5, 
6). The results show that the sensitivities of Buck's method in S1A 
family and PDZ family were 85.07% and 82.35%, respectively, 
while those of our method were 9 1 .04% and 94. 11%. In addition, 
the PPVs of Buck's method in SI A family and PDZ family were 
43.84% and 29.16%, respectively, while those of our method were 
90.77% and 94.11%. The results clearly demonstrate that 
BIFANR performs much better than Buck's method. 



sector 1 sector 2 sector 3 all sites 

SI A -0.7288 -0.4520 -1.2143 0 

PDZ -0.8683 -0.9163 - 0 

doi:1 0.1 371 /journal.pone.0079764.t001 



4. Function Analysis of Protein Structure 

BIFANR detected three and two protein sectors in SI A family 
and PDZ family respectively (see Table S 1 for the amino acids in 
each protein sector). Strikingly, the amino acid sites in the three 
protein sectors of SI A family are not linearly close to each other in 
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Figure 4. Distribution of amino acid site evolutionary rates. A: Distribution of amino acid site evolutionary rates in S1 A family (rat trypsin: 
3TGI). B: Distribution of amino acid site evolutionary rates in PDZ family (rat PSD-95:1 BE9). Red column represents the evolutionary rate distribution of 
amino acid sites in protein sectors. 
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the sequence, but apparently are correlated in the tertiary 
structure (Figure 7A-C). In addition, protein sectors tend to 
independent of each other in protein function. 

In S1A family, protein sector 1 mainly contains amino acids 
surrounding the pocket of SI enzyme. Amino acid mutations 
within this sector may affect the substrate specificity of some 
enzymes in the family as residues in this sector is involved in 
transferring chymotryptic specificity into trypsin [36,37,38,39,40]. 
Protein sector 2 mainly contains amino acids of the two fi-sheets in 
the protein core. The double alanine mutation in this sector could 
affect the thermal stability of the enzyme, but hardly affect the 
catalytic ability. Moreover, the mutations in this sector are 
synergistic. The effect of sector 1 to substrate specificity is 
independent on that of sector 2 to structure stability [7,41,42]. 
Protein sector 3, which is mainly responsible for catalytic ability, 
contains the catalytic triad (57H, 102D and 195S) and neighboring 
amino acids that are related to catalytic ability or accounting for 
allosteric regulation [36,43,44,45]. This sector also includes one 
disulfide bond pair (42C-58C) and the substitution of this bond 
would cooperatively interact with mutation of SI 95. In addition, 
triple mutation of C42A, C58A/V, and S195T will convert trypsin 
from a serine protease to a threonine protease. So, this sector 
represents the catalytic core of this protease family. 

In PSD95/Digl/Z01 (PDZ) protein domain family, protein 
sector 1 contains amino acids in ot 2 — P2 groove and 01] -helix, which 
affect the substrate binding affinity [31,46] and the regulation of 



0t 2 — P2 groove affinity [47]. The sites in this sector are either in 
relation to each other direcdy or are connected through 
interactions with the substrate peptides. In protein sector 2, 
residues 36 and 75 co-mutate to cysteine may be responsible for 
the redox-dependent equilibrium [48,49] between two conforma- 
tions in INAD PDZ5 [50] (Figure 8). 

To further investigate the function and independent evolution 
of protein sectors, we analyzed the evolutionary independence of 
the three protein sectors in S1A family. Evolutionary indepen- 
dence test is to construct a similarity matrix M with the sequence 
similarities of amino acid sites in a protein sector and then conduct 
principal component analysis. In principal component analysis, 
only one principal component was selected and all sequences were 
separated into two parts according to factor coefficients. Taking 
S 1 A family as an example, the results of the principal component 
analysis of protein sectors 1, 2 and 3 were displayed in Figure 9A, 
9B and 9C, respectively. According to the sequence similarity of 
sites in each protein sector, casein (red, top) and chymotrypsin 
(blue, below) proteins are separated by protein sector 1 (Figure 9A); 
vertebrates and non-vertebrates are separated by protein sector 2 
(Figure 9B); and enzymes and non-enzymes are separated by 
protein sector 3 (Figure 9C). These results indicate that protein 
sector 1 may be responsible for the specificity of substrate 
recognition in the catalytic process; protein sector 2 may be 
involved in the protein backbone evolution, while protein sector 3 
may account for the protein catalytic activity. 
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Figure 5. Comparison between BIFANR and Buck's in SI A family. A: S1,S2, and S3 represent 3 experimental confirmed protein sectors in S1 A 
family and the height of color bar represents the number of sites in corresponding predicted protein sector by Buck's method. B: S1 ,S2, and S3 
represent 3 protein sectors in S1A family and the height of each color bar represents the number of sites in corresponding predicted sector by 
BIFANR. And the height of the brown bar represents the number of lost sites by algorithms in each experimental confirmed protein sector. 
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5. Application to Hsp70/110 Family and G Protein Family 

To demonstrate the generality of the algorithm BIFANR, we 
carried out protein sector prediction for another 2 protein families: 
Hsp70/110 family and G protein family. For the Hsp70/110 
family, the MSA consists of 926 sequences and 605 positions, and 
for the G protein family the MSA consists of 678 sequences and 
160 positions. BIFANR detected 2 significant protein sectors for 
each of the two families (see Figure 10 and Table S2 ). The 
internal correlation test and the statistical independence test for 
both datasets showed that the conclusions drawn from the 
experiments on Hsp70/110 family and G protein family were 
also well supported (Fig SI and S2). 

Discussion 

Exploring the coevolved protein sectors among homologous 
proteins is currently a hot issue, especially for the studies of the 
biological features and evolutionary direction of proteins. There 
have been a few methods developed for detecting coevolved sites 
in a protein family but they all suffer from low accuracy and low 
robustness. In this paper, we proposed a new algorithm BIFANR 
aiming to address these issues. 

BIFANR is unique in the following aspects. First, BIFANR has 
a noise reduction step for the sites in MSA. This step can reduce 
the complexity of the calculation and improve the accuracy. 
Second, motivated by factor analysis, a stochastic simulation step is 
adopted to choose non-random eigenvectors. This step ensures 
that the protein sectors detected are non-random and thus of high 



credibility. Third, BIFANR uses varimax orthogonal rotation to 
calculate the linear combination of selected eigenvectors, which 
leads to the significant statistical independence between protein 
sectors. Fourth, the algorithm avoids manual curation, such as 
visual inspecting and screening, thus is more practical to use for 
high throughput analysis. 

Besides, BIFANR is robust for various data scales. When the 
data is randomly reduced to half in size, the result remains almost 
the same. We did this on both S1A and PDZ family and compared 
the new results with the old ones. As shown in Table 2 and Table 3, 
the accuracy remained high especially for PDZ family, which 
indicates that BIFANR is robust for data scales. 

In the future, we will consider using the common amino acid 
substitution matrix (e.g. PAM or BLOSUM) to incorporate the 
relationships among amino acids, as currently BIFANR assumes 
that all the 20 amino acids are independent. In addition, we will 
work with biologists to use our predicted sectors to guide site- 
specific mutagenesis experiments on some selected genes of 
interests. 

Materials and Methods 

1. Obtaining Materials 

In this study, we chose the classic SI A serine protease family 
and PDZ family for protein sector analysis (see Data SI. zip) as R. 
Ranganathan did in previous study [7]. The members of SI A 
family have the same peptide bond hydrolysis mechanism and 
possess broad substrate spectrum. PDZ family is a common 
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Figure 6. Comparison between BIFANR and Buck's in PDZ family. A: S1 and S2 represent 2 experimental confirmed protein sectors in PDZ 
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domain in signal protein, widely existing in bacteria, fungi, plant, 
animal, and virus [51,52,53,54], which mediates the protein- 
protein interaction between ot 2 -P 2 groove and the C-terminal 
ligand of target protein. The dataset was obtained through PSI- 
BLAST [55] from NCBI (release 2.2.14, May-07-2006) non- 
redundant database with 3TGI and 1BE9 as the template 
sequences, and the multiple sequence alignment was provided 
by Clustal X [7,56]. 



2. Construction of Weighted Covariance Matrix and 
Reduction of Noise Sites 

2.1 Construction of covariance matrix. Proteins consist of 
20 common amino acids. For the purpose of calculation, we 
replace 20 common amino acids with number 1-20 and gap with 
0. 

BIFANR constructs the covariance matrix by the formula: 



r (<*) - 



(a) fib) 



(1) 






Figure 7. Protein sectors of SI A family (Rat trypsin PDB: 3TGI). A: Red balls represent protein sector 1, which mainly comprises residues 
located near the catalytic cleave. B: Blue balls represent protein sector 2, which comprises residues within the core of the two (3 barrels. C: Green balls 
represent protein sector 3, comprising residues within the catalytic cleave. Residues comprising protein sectors are displayed in space filling 
representation with a van der Waals surface. 
doi:1 0.1 371 /journal.pone.0079764.g007 
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Figure 8. Protein sectors of PDZ family (Rat PSD-95 DPB: 1 BE9). A: Red balls represent residues within protein sector 1 . B: Blue balls represent 
residues within protein sector 2. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface. 
doi:1 0.1 371 /journal.pone.0079764.g008 



in which ff a ' is the observed frequency of the amino acid a at 
position i and fi/ ah) represents the joint frequency of having a at 
position i and b at position j. 

BIFANR constructs the weighted covariance matrix C by the 
formula:. 

ct ) =^ b) cr\^ a) = ^ 1 ^ (2) 

C,y= /E^rV (3) 

where represents the background frequency of amino acid a. C 
is the weighted covariance matrix. The more relevant between 
sites i and j, the higher probability of synergistic reaction and the 
larger correlation coefficient; On the contrary, the correlation 
coefficient is small. 

2.2 Reduction of noise sites. In this study, we consider that 
every two sites in one protein sector have significant correlation, 
thus have large correlation coefficient. For each site i, we take top 
5% sites that have the largest correlation coefficients with i. And 
we calculate the average of them and represent this average with 
Rmax(i),. Then, we sum Rmax(i) for all sites i and calculate its 



average, represented with plus. If site i belongs to one protein 
sector, its Rmax(i) should be larger than O.Splus. Thus, the site with 
Rmax no larger than O.Splus is considered as noise site. Finally, we 
remove the rows and columns of noise sites in C and represent the 
new matrix with C. 

3. Bi-factor Analysis 

3.1. Selection of eigenvectors. In this study, we selected 
eigenvectors by the following steps. (1) Calculate eigenvalues 
A = {/.i,/.2,...,2 K } of the matrix C in descending order, where n is 
the number of amino acid sites. (2) Scramble each column of the 
original alignment randomly and independently, and represent the 
new random alignment with Al. Then we calculate the eigenvalues 
of Al. After randomly combining for 100 times, we obtain lOOn 
eigenvalues and put them into set E. (3) Count the number of 
eigenvalues in E which are bigger than in A and represent this 
number with Ml), Obviously, Mi) is in increasing order for i = 1, 
. . ., n. (4) Let Mt) be the last one less than 1 00 in {Mi}; z = 1 , ...,n) 
and take the corresponding eigenvectors oi k\,...,X t . 

3.2 Rotation, the linear combination of 
eigenvectors. We conduct varimax orthogonal rotation which 
can maximum [57]: 




LI L6 



Figure 9. Evolutionary independence of protein sectors in S1A family. A: Evolutionary independence of protein sector 1. B: Evolutionary 
independence of protein sector 2. C: Evolutionary independence of protein sector 3. 
doi:1 0.1 371 /journal.pone.0079764.g009 



PLOS ONE | www.plosone.org 



8 



November 2013 | Volume 8 | Issue 11 | e79764 



Bi-Factor Analysis Based on Noise-Reduction 





B 





C D 

Figure 10. Protein sectors of G protein family (A and B) and Hsp70/110 family (C and D). Red balls and Green balls represent different 
protein sectors in protein 3D structure. Residues comprising protein sectors are displayed in space filling representation with a van der Waals surface. 
doi:1 0.1 371 /journal.pone.0079764.g010 



Vk=[J2(n(Pik 2 ) 2 - J2(p Jk 2 )?]/n\k=l,...,t (4) 

i=l 7 = 1 

where is pik the 8-th element of the A-th eigenvector after rotation. 

3.3 Bidirectional selection of protein sectors. We con- 
struct protein sectors for each factor (eigenvector after rotation) as 
following. (1) Put factor coefficients of this factor in descending 
order, select top 50% and calculate the average of them, 

represented with w. Then we calculate the ratio r(i) = where 

W 

p(i) is the coefficient between the factor and amino acid site i, i.e. 



Table 2. The performance of BIFANR on original data and half 
data in SI A family. 







Original data 


Half data 


sectorl 


95% 


90% 


sector2 


92% 


88% 


sector3 


90.91% 


90.91% 


doi:1 0.1 371 /journal.pone.0079764.t002 



the z'-th sector of the factor. (2) If the ratio r[ij is not smaller than 
the given threshold y and p{i) is bigger than the given threshold c5, 
amino acid site i belongs to the protein sector. (3) Besides, if p(i) is 
large enough, say no smaller than the given threshold 6, then this 
amino acid belongs to the protein sector, too. 

3.4 Merging of protein sectors. Bidirectional selection of 
protein sectors may lead to the occurrence of overlap protein 
sectors and we merge them into one as following. (1) For two 
overlap protein sectors, we use a vector named Ssame to record the 
overlap sites and another vector Sdiff to record the symmetric set 
difference. (2) Select such site from Sdiff satisfying that the sum of 
the correlation coefficients between the site and sites in Ssame is the 
biggest, and put it into Ssame. (3) Repeat step (2) until Ssame reaches 



Table 3. The performance of BIFANR on original data and half 
data in PDZ family. 





Original data 


Half data 


sectorl 100% 


100% 


sector2 90.91% 


90.91% 


doi:1 0.1 371 /journal.pone.0079764.t003 



PLOS ONE | www.plosone.org 



9 



November 2013 | Volume 8 | Issue 11 | e79764 



Bi-Factor Analysis Based on Noise-Reduction 



the size of the smaller protein sector before merging. Then we 
output Ssame as protein sector. 

4. Examination of Protein Sectors 

4.1 Correlations between amino acids in a protein 
sector. We calculate the average correlation coefficients and 
random average correlation coefficients of each protein sector. We 
use these two parameters to measure the significant correlation 
between amino acids in a protein sector. 

4.2 Statistical independence. We use MDI entropy to 
measure the degree to which a selected group of residues are 
statistically coupled to each other in the multiple sequence 
alignment. In this study, the definition of statistical independence 
is that, if two protein sectors are independent, then the MDI 
entropy of two taken together must be the sum of their MDI 
entropies taken individually [7,58]. We adapt generalized iterative 
scaling algorithm to calculate MDI entropy [59]. 

4.3 Calculation of the evolutionary rate. In this study, the 
evolutionary rate is estimated by Rate4site. Rate4Site adapts 
maximum likelihood criterion to estimate the normalized rate of 
evolution at each site, taking into consideration the topology and 
branch lengths of the phylogenetic tree. The sites with positive 
values evolve faster than average, and sites with negative values 
evolve slower than average for that protein. 

4.4 Evolutionary independence. To evaluate the evolution- 
ary independence of protein sectors, we use Principle Component 
Analysis to separate proteins based on the sequence similarities of 
sites in protein sector. 

Programs of algorithm BIFANR can be obtained from Program 
SI i. 

Supporting Information 

Figure SI Average correlation coefficients of protein 
sectors. A: Average correlation coefficient of each protein sector 
in G protein family. B: Average correlation coefficient of each 
protein sector in Hsp70/110 family. Blue and green columns 
represent average correlation coefficient of protein sector 1 and 
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